6. Using Gaussian Proceses

  • hyperparameter tuning

  • application to case study

  • interpretability evaluation

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.

Reading

  • Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017

  • Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.

Hyperparameter Optimization

Marginal Likelihood

Q: How do we choose kernel hyperparameters \(\theta = \left(\ell, \sigma_f, \dots\right)\)

A: Find \(\theta\) that makes the observed data \(\mathbf{y}\) most probable:

Marginal likelihood

\[p(\mathbf{y} | \mathbf{X}, \theta) = \int p(\mathbf{y} | \mathbf{f}, \mathbf{X}) p(\mathbf{f} | \mathbf{X}, \theta) d\mathbf{f}\]

This “marginalizes” over all possible functions \(\mathbf{f}\) with hyperparameter \(\theta\).

Marginal Likelihood

For the GP model, this has a closed form:

\[\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2\mathbf{I}| + \text{const}\]

This has two competing terms – data fit vs. model complexity.

Automatic Occam’s Razor

Consider fitting a sine wave with noise. How would you choose the RBF \(\ell\)?

Very small \(\ell\) (underfitting)

  • Will interpolate the noise, not just the cuve
  • Spreads probability over too many datasets
  • Low marginal likelihood

Very large \(\ell\) (overfitting)

  • Will oversmooth the sine component
  • Low marginal likelihood

The right choice of \(\ell\) will be flexible enough to fit the sine curve but not the noise. The marginal likelihood finds this automatically without any cross-validation.

Marginal Likelihood Surface

  • Non-convex: multiple local optima
  • Each optimum corresponds to different explanation of data
  • Gradient-based optimization sensitive to initialization

Initializing Signal Variance

Observation = Signal + Noise

\[\begin{align*} y = f(x) + \epsilon \end{align*}\]

Variance Decomposition

\[\begin{align*} \underbrace{\operatorname{Var}[y]}_{\text {what we see }}=\underbrace{\sigma_f^2}_{\text {signal }}+\underbrace{\sigma_n^2}_{\text {noise }} \end{align*}\]

Initializing Signal Variance

Heuristic 1: If the instrument’s precision is known (e.g., Kepler telescope noise \(\sigma_n \approx 10^-4\)), then set \[\begin{align*} \sigma_f^2=\operatorname{Var}[y]-\sigma_n^2 \end{align*}\]

Heuristic 2: Suppose the signal dominates by some assumed SNR factor \(\kappa\) (e.g., 2 - 100). Then set \[\begin{align*} \sigma_{f}^{2} \approx \operatorname{Var}[y]\\ \sigma_{n}^{2} \approx \frac{\sigma_f^2}{\kappa^2} \end{align*}\]

Initializing Lengthscale

Lengthscale controls how far we travel before \(f(x)\) and \(f(x')\) decorrelate

Heuristic: Set lengthscale relative to spread of input data \[\ell \approx \lambda \cdot \text{SD}[\mathbf{X}], \quad \lambda \in [0.2, 10]\]

Alternative: Use median distance from mean \[\ell \approx \text{median}\{|x_i - \bar{x}|\}\]

Case Study

Discussion

Respond to [GP True/False] parts (a - c) in the exercise sheet.

Select all the TRUE statements about GPs below.

  1. TRUE FALSE The posterior variance at a test point x* is always smaller than the prior variance at that point, regardless of where the training data are located.

  2. TRUE FALSE Multiplying two kernels \(k^{\text{new}}(x, x') = k_{1}(x, x')k_{2}(x, x')\) always results in a valid positive semi-definite kernel.

  3. TRUE FALSE The posterior variance at a test point \(x^*\) is always less than or equal to the prior variance at that point, regardless of the distances to the training data.

Case Study Takeaways

The kernel is a hypothesis about the data generating process.

  • Quasiperiodicity \(\to\) rotating star with evolving spots
  • Not an arbitrary computational choice

What we learned

  • Composing scientifically-informed kernels from elementary components
  • Defining hyperparameter priors and model likelihood in R
  • Visualize posterior samples of \(\mathbf{f}\) and \(\theta\)

The GP is not just a curve fitting routine, it helps discover meaningful astrophysics.

Interpretability

Interpretability Criteria

  • Predictive accuracy: Can we approximate complex functions?
  • Parameter meaning: Do parameters correspond to scientific quantities?
  • Uncertainty: Does model know what it doesn’t know?
  • Visualization: Can you draw what the model believes?
  • Falsifiability: Can you identify when the model is wrong?

GPs do well on all five… deep learning fails 2 - 5.

Comparison with Linear Regression

Linear regression

  • Parametric: \(y = \beta_0 + x^\top \beta\).
  • Interpretable: “\(\beta_{1}\) is the effect of \(x_{1}\) all else held equal”
  • Global: same \(\beta\) everywhere
  • Limited: cannot capture nonlinearity without feature engineering

Gaussian processes

  • Nonparametric: no fixed functional form
  • Interpretable Hyperparameters: \(\ell\) (scale), \(p\) (period), …
  • Local: predictions weighted by kernel
  • Flexible: captures complex patterns

Nonparametric but Interpretable

  • Nonparametric -> complexity grows as the number of samples increases
  • Not “effect sizes” but “characteristic scales”
  • “How quickly does \(f\) vary?” not “How much does \(x\) affect \(y\)?”

Both GPs and linear regression are interpertable, but in different senses.

Limitations and Extensions

Limitations:

  • 1D time series only
  • Observed data are Gaussian
  • Stationary/periodic assumptions

Extensions:

  • Non-Gaussian observations: Maybe we want a Student-t likelihood for robustness or softmax layer for classification.
  • Multidimensional inputs or outputs: We might have other relevant measurements for each star, and may want to model several stars at once.